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Abstract 

The CUR matrix decomposition and Nystrom method are two important low-rank matrix 
approximation techniques. The Nystrom method approximates a positive semidefinite ma- 
trix in terms of a small number of its columns, while CUR approximates an arbitrary data 
matrix by a small number of its columns and rows. Thus, the CUR decomposition can be 
regarded as an extension of the Nystrom method. 

In this paper we establish a more general bound for the adaptive column/row sampling 
algorithm, based on which we propose improved CUR and Nystrom algorithms with ex- 
pected relative-error bounds. The proposed CUR and Nystrom algorithms also hold low 
time complexity and can avoid maintaining the whole data matrix in RAM. In addition, 
we give theoretical analysis for the lower bounds of the conventional Nystrom method and 
the ensemble Nystrom method. In our work we make no assumption on the data matrix, 
and our main theoretical results established in this paper are novel and encouraging. 
Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystrom 
method, randomized algorithms, adaptive sampling 

1. Introduction 

Large-scale matrices emerging from stocks, genomes, web documents, web images and videos 
everyday bring new challenges in modern data analysis. Most efforts have been focused on 
manipulating, understanding and interpreting large-scale data matrices. In many cases, 
matrix factorization methods are employed for constructing compressed and informative 
representations to facilitate computation and interpretation. A principled approach is the 
truncated singular value decomposition (SVD) which finds the best low-rank approximation 
of a data matrix. Applications of SVD such as eigenface (Sirovich and Kirby, 1987, Turk and 
Pcntland, 199 J ) and latent semantic analysis (Dccrwcster et al., 1990) have been illustrated 
to be very successful. 

However, using SVD to find basis vectors and low-rank approximations has its limi- 
tations. As pointed out by Berry et al. (2005), it is often useful to find a low-rank ma- 
trix approximation which posses additional structures such as sparsity or nonnegativity. 
Since SVD or the standard QR decomposition for sparse matrices does not hold sparsity 
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in general, when the sparse matrix is large, computing or even storing such decompositions 
becomes challenging. Therefore it is useful to compute a low-rank matrix decomposition 
which preserves such structural properties of the original data matrix. 

Another limitation of SVD is that the basis vectors resulting from SVD have little 
concrete meaning, which makes it very difficult for us to understand and interpret the data 
in question. An example in Drineas et al. (2008), Malioncy and Drincas (2009) has well 
shown this viewpoint; that is, the vector [(l/2)age— (l/\/2)height + (l/2)income], the sum 
of the significant uncorrelated features from a dataset of people's features, is not particularly 
informative. Kuruvilla et al. (2002) have also claimed: "it would be interesting to try to find 
basis vectors for all experiment vectors, using actual experiment vectors and not artificial 
bases that offer little insight." Therefore, it is of great interest to represent a data matrix 
in terms of a small number of actual columns and/or actual rows of the matrix. Matrix 
column selection and the CUR matrix decomposition provide such techniques. 

1.1 Matrix Column Selection 

Column selection has been extensively studied in the theoretical computer science (TCS) 
and numerical linear algebra (NLA) communities. The work in TCS mainly focuses on 
choosing good columns by randomized algorithms with provable error bounds (Frieze et al., 
2004, Deshpande et al., 2006, Drineas et al., 2008, Deshpande and Rademacher, 2010, Bout- 
sidis ct al., 2011, Guruswami and Sinop, 2012). The focus in NLA is then on deterministic 
algorithms, especially the rank-revealing QR factorizations, that select columns by pivoting 
rules (Foster, 1986, Chan, 1987, Stewart, 1999, Bischof and Hansen, 1991, Hong and Pan, 
1992, Chandrasekaran and Ipsen, 1994, Gu and Eisenstat, 1996, Berry et al., 2005). In this 
paper we make main attention on randomized algorithms for column selection. 

Given a matrix A E M™-^"-, column selection algorithms aim to choose c columns of 
A to construct a matrix C € M™^'^ so that ||A — CC^^A||g achieves the minimum. Here 

= 2" or = F" represents the matrix spectral norm or the matrix Frobenius norm, 
and denotes the Moore-Penrose inverse of C. Since there are (") possible choices of 
constructing C, selecting the best subset is a hard problem. 

In recent years, many polynomial-time approximate algorithms have been proposed. 
Among them we are especially interested in those algorithms with relative- error upper 
bounds; that is, there exists a polynomial function f(m,n,k,c) such that with c (> k) 
columns selected from A 

||A- CC'I'AII^ < f {m,n,k,c)\\ A - AkW^ 

with high probability (w.h.p.) or in expectation w.r.t. C. We call / the approximation 
factor. The bounds are especially good when / does not depend on m and n, in which 
case the bounds are known as constant-factor bounds (Mahoney, 2011). Particularly, the 
bounds are strong if for any error parameter e there exists a function C{k,e) such that 
/(m, n, k, c) = 1+e when c > C{k, e). 

However, the column selection method, also known as the A « CX decomposition 
in some applications, has its limitations. For a large sparse matrix A, its submatrix C is 
sparse, but the coefficient matrix X G M'^^" is not sparse in general. The CX decomposition 
suffices when m ^ n, because X is small in size. However, when m and n are near equal. 
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computing and storing the dense matrix X in RAM becomes a problem. In such occasion 
the CUR matrix decomposition is a very useful alternative. 

1.2 The CUR Matrix Decomposition 

The CUR decomposition problem has been widely discussed in the literature (Goreinov 
et al., 1997a,b, Stewart, 1999, Tyrtyshnikov, 2000, Berry ct al., 2005, Drineas and Mahoney, 
2005, Mahoney et al., 2008, Bien et al., 2010), and it has been shown to be very useful in high 
dimensional data analysis (Mahoney and Drineas, 2009). Particularly, a CUR decomposition 
algorithm seeks to find a subset of c columns of A to form a matrix C G M'"^'^, a subsect 
of r rows to form a matrix R £ M**^", and an intersection matrix U G M'^^'' such that 
||A — CUR||^ is small. Accordingly, we use A = CUR to approximate A. 

Drineas et al. (200G) proposed a CUR algorithm with additive-error bound. Later on, 
Drineas et al. (200S) devised a randomized CUR algorithm which has (l+e) relative-error 
bound w.h.p. if sufficiently many columns and rows are sampled. Recently, Mackcy ct al. 
(2011) established a divide-and-conquer method which solves the CUR problem in parallel. 
Also, CUR algorithms guaranteed by (H-e) relative-error bounds are of great interest. 

Unfortunately, the existing CUR algorithms usually require a large number of columns 
and rows to be chosen. For example, for an mxn matrix A and a target rank k <C 
min{m, n}, the subspace sampling algorithm (Drineas et al., 2008) — a classical CUR algorithm — 
requires 0{ke~'^logk) columns and 0{ke~^log'^ k) rows to achieve (l+e) relative-error 
bound w.h.p. The computational cost of this algorithm is at least equal to the cost of the 
truncated SVD of A, that is, 0{mnk) in general. However, maintaining a large scale matrix 
in RAM is often impractical, not to mention conducting SVD. Though the computational 
issue was recently solved by a randomized algorithm which provides a fast approximation to 
statistical leverage scores (Drineas ct al., 2012), the error bound of the subspace sampling 
algorithm is still not satisfactory. 

The CUR matrix decomposition problem has a close connection with the column se- 
lection problem. Especially, most CUR algorithms such as those in Drineas and Kannan 
(2003), Drineas ct al. (2006, 2008) work in a two-stage manner. Moreover, Stage 1 is a 
standard column selection procedure. If the second stage is naively solved by a column 
selection algorithm on A-^, then the approximation factor will trivially be \/2/ ^- Thus, 
more sophisticated error analysis techniques for the second stage are indispensable in order 
to achieve (l+e) relative-error bound. 

1.3 The Nystrom Methods 

The Nystrom method is closely related to CUR, and it can potentially benefit from the 
advances in CUR techniques. The difference between both is that the Nystrom method is 
used for approximating positive semidefinite (p.s.d.) matrices. The method approximates 
a p.s.d. matrix only using a subset of its columns, so it can alleviate computation and 
storage problems when the p.s.d. matrix in question is large in size. In fact, the Nystrom 
method has been extensively used in the machine learning community. For example, the 

1. Since ||A - CUR|||. = ||A - CC^A + CC^A - CCtARtRIII = ||(I - CC''")A|||. + ||CC1"(A - 
ARtR)!!!, < ||A-CCtA||| + ||A- ARtRIII < 2/2||A- Afe|||,, where the second equahty follows from 
(I-CC"l')^CCt =0. 
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Nystrom method has been appUed to Gaussian processes (WilHams and Seeger, 2001), kernel 
SVMs (Zhang et al., 2008), spectral clustering (Fowlkes et al., 2004), kernel PCA (Talwalkar 
ct al., 2008, Zhang et al., 2008, Zhang and Kwok, 2010), etc. 
Given an m x m p.s.d. matrix of the form 



^21 



^22 



where W is cxc and A21 is (m— c)xc, the standard Nystrom method samples c columns of 
A to construct a matrix C G M'"^'^. Without loss of generality, we assume that the first c 
columns are selected; that is, 

" W 



^21 



Then the rank-c Nystrom approximation to A is 



W 

A21 



k-21 



^21 
WtAi^i 



(1) 



We call W'l' the intersection matrix. The matrix (Wfc)"^, where k < c and W^, is the 
best fc-rank approximation to W, is also used as an intersection matrix for constructing 
approximations with even lower rank, but using W^^ results in a tighter approximation than 
using w|, usually. 

Alternatively, the ensemble Nystrom method of Kumar et al. (2009) selects a collections 
of t samples, each sample C^*\ (i = 1, • • • ,t), containing c columns of A. Then the ensemble 
method combines the samples to construct an approximation of the form 



A ens 



where fx^^^ are the weights of the samples. Typically, the ensemble Nystrom method seeks 
to find out the weights by minimizing || A — Af'J'^ll^ or || A — A^'J.'^||2. A simple but effective 
strategy is to set the weights as fj,^^^ = ■ ■ ■ = ^(*) = i. 

Much work has been built on the upper error bounds of the sampling techniques for 
the Nystrom method. Most of the work, e.g., (Drincas and Mahoncy, 2005, Li ct al., 2010, 
Kumar et al., 2009, Jin et al., 2011, Kumar et al., 2012), studied the additive-error bound. 
With assumptions on matrix coherence, better additive bounds were obtained by Talwalkar 
and Rostamizadch (2010), .Jin ct al. (2011), Mackcy ct al. (2011). However, as stated by 
Mahoncy (2011), additive-error bounds are less compelling than relative-error bounds. In 
one recent work, Gittcns (2011) provided a relative-error bound for the first time. Later 
on, Gittens and Mahoney (2013) sharpened this result. 

However, the error bounds of the previous Nystrom methods are much weaker than 
those of the extant CUR algorithm, especially the relative-error bounds in which we are 
more interested (Mahoney, 2011). Actually, as will be proved in this paper, the lower error 
bounds of the conventional Nystrom method and the ensemble Nystrom method are even 
much weaker than the upper bounds of some extant CUR algorithms. This motivates us to 
improve the Nystrom method by borrowing the techniques in CUR matrix decomposition. 
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Table 1: Comparisons between our adaptive sampling based CUR algorithm and the best 
existing algorithm — the subspace sampling algorithm of Drincas ct al. (2008). 
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1.4 Contributions and Outline 

The main technical contribution of this work is the adaptive sampling bound in Theorem 4, 
which is an extension of Theorem 2.1 in Deshpaiidc ct al. (2006). Theorem 2.1 of Dcshpandc 
et al. (2006) bounds the error incurred by projection onto column or row space, while our 
Theorem 4 bounds the error incurred by the projection simultaneously onto column space 
and row space. Thus, Theorem 2.1 in Dcshpandc ct al. (2006) can be regarded as a special 
case of Theorem 4. It is worth pointing out that Theorem 4 is not a trivial extension of 
Theorem 2.1 in Deshpande et al. (2006) because Theorem 4 cannot be trivially proved using 
the techniques in Deshpande et al. (2006). 

More importantly, our adaptive sampling bound provides an approach for improving the 
CUR and Nystrom approximation methods: no matter which (1+e) relative-error column 
selection algorithm is employed. Theorem 4 ensures (1+e) relative-error bounds for the CUR 
and Nystrom approximation methods. We present the results in Corollary 6. 

Based on the adaptive sampling bound in Theorem 4 and its corollary 6, we provide a 
concrete CUR algorithm which beats the best existing algorithm — the subspace sampling 
algorithm — both theoretically and empirically. Our proposed CUR algorithm employs the 
near-optimal column selection algorithm of j_j()utsidis ct al. (2011) for column/row sampling. 
The CUR algorithm is described in Algorithm 2 and analyzed in Theorem 7. In Table 1 
we present a comparison between our proposed CUR algorithm and the subspace sampling 
algorithm. As we see, our algorithm requires much fewer columns and rows to achieve 
the same approximation factor (1+e). Our method is more scalable for it works on only 
a few columns or rows of the data matrix in question; by contrast, the subspace sampling 
algorithm maintains the whole data matrix in RAM to implement SVD. 

Another important application of the adaptive sampling bound is to yield an improved 
Nystrom algorithm shown in Theorem 9. The improved Nystrom algorithm is different from 
the standard Nystrom method in that it uses a different intersection matrix. Particularly, we 
use CA(C"^)^ instead of as the intersection matrix. It is obvious that using CA(C"^)^ as 
the intersection matrix makes the approximation tighter. Moreover, the improved Nystrom 
algorithm has a strong relative-error upper bound: for a target rank fc, by sampling ^ (l + 
o(l)) columns it achieves an (1+e) relative-error bound in expectation. 

Finally, we establish the lower bounds of the Nystrom methods that use W^^ as the 
intersection matrix. We show the results in Table 2. From the table we can see that the 
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Table 2: Additive-error and relative-error lower bounds of the standard Nystrom method 
and the ensemble Nystrom method. The blanks indicate the lower bounds are 
unknown to us. Here m denotes the column/row number of the p.s.d. matrix, and 
c denotes the number of selected columns. 
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upper error bound of our improved Nystrom method is even better than the lower bounds 
of the conventional Nystrom methods 

The rest of this paper is organized as follows. In Section 2 we present the notation that 
will be used in this paper. In Section 3 we survey several related work including randomized 
column selection and previous work on CUR matrix decomposition. In Section 4 we present 
our theoretical results and corresponding algorithms. In Section 5 we empirically evaluate 
our improved CUR and Nystrom algorithms. Finally, we conclude our work in Section 6. 
All proofs are deferred to the appendix. 

2. Notation 

First of all, we present the notation and notion that are used here and later. We let Im 
denote the m x ni identity matrix, 1^ denote the mxl vector of ones, and denote a zero 
vector or matrix with appropriate size. For a matrix A = [aij] £ M'"^"', we let a^*) be 
its i-th row, slj be its j-th column, Ai-j be a submatrix consisting of its i to j-th columns 
{i < j), and A.^j be a submatrix consisting of a subset of column indexed by I. 

Let p = rank(A) < min{m, n} and k < p. The singular value decomposition (SVD) of 
A can be written as 

p 

1=1 

where Ua.A: (rnxk), XlA,fc (kxk), and Va,A; (nxk) correspond to the top k singular values. 
We denote A^ = UA,A;SA,fcVj ^ which is the best (or closest) rank-A; approximation to A. 
We also use o"j(A) = aA,i to denote the i-th largest singular value. When A is p.s.d. the 
SVD is identical to the eigenvalue decomposition, in which case we have Ua = Va- 

We define the matrix norms as follows. Let ||A||i = be the £i-norm, ||A||i? = 

(Eij afj)^^^ = (Ei^A.i)^^^ t)e the Frobenius norm, ||A||2 = maXxgiR",||xH2=i l|A-x||2 = (7a,i 
be the spectral norm, and ||A||* = cta.i be the nuclear norm. We always use ||A||^ to 
represent ||A||2 or ||A||i7. 

2. This can be valid because the lower bounds in Table 2 do not hold when the intersection matrix is not 



^A,fc 







^A,A,a 
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Based on SVD, the statistical leverage scores of the columns of A relative to the best 
rank-Zc approximation to A is defined as 

4" = ||(Va,.)(^')||2, J = l,---,n. 

We have that X]j=i ^f^ ~ ^- The leverage scores of the rows of A are defined according to 
Ua,A;- The leverage scores play a crucial role in low-rank matrix approximation. Informally 
speaking, the columns (or rows) with high leverage scores have greater influence in rank-A; 
approximation than those with low leverage scores. 

Additionally, let A^^ = V^.^S^^^Uj ^ be the Moore-Penrose inverse of A (Ben-Israel 
and Greville, 2003). When A is nonsingular, the Moore-Penrose inverse is identical to 
the matrix inverse. Given matrices A G M"*^'^, X G M™^p, and Y G M^xn^ XXtA = 
UxU^A G j^mxn -g ^YiQ projection of A onto the column space of X, and AY^Y = 
AVyVy G M™^" is the projection of A onto the row space of Y. 

Finally, we discuss the computational costs of the matrix operations mentioned above. 
For an mxn general matrix A (assume m > n), it takes 0{mn?) flops to compute the full 
SVD and 0{mnk) flops to compute the truncated SVD of rank k {< n). The computation 
of A^ also takes 0{mv?') flops. It is worth mentioning that, although multiplying an mxn 
matrix by an nxp matrix runs in mnp flops, it can be easily implemented in parallel (Halko 
ct al., 2011). By contrast, operations like SVD and QR decomposition are hard to be 
implemented in parallel. So we denote the time complexity of such matrix multiplication 
by ^Multiply (mnp) , which can be tremendously smaller than 0{mnp) in practice. 

3. Previous Work 

In Section 3.1 we present an adaptive sampling algorithm and its relative-error bound 
established by Deshpande et al. (2006). In Section 3.2 we highlight the near-optimal column 
selection algorithm in Boutsidis ct al. (2011) which we will use in our CUR and Nystrom 
algorithms for column/row sampling. In Section 3.3 we introduce two important CUR 
algorithms. 

3.1 The Adaptive Sampling Algorithm 

Adaptive sampling is an effective and efficient column selection algorithm for reducing the 
approximation error. After one has selected a small subset of columns (denoted Ci), an 
adaptive sampling method is used to further select a proportion of columns according to 
the residual of the first round, that is, A — CiCj A. The approximation error is guaranteed 
to be decreasing by a factor after the adaptive sampling (Deshpande et al., 2006). We show 
the result of Deshpande ct al. (2006) in the following lemma. 

Lemma 1 (The Adaptive Sampling Algorithm, Deshpande et al. (2006)) Given a 
matrix A G M*"^"", we let Ci G W^^^^ consist of ci columns of A, and define the residual 
B = A — CiC| A. Additionally, for i = 1, • • • , n, we define 

Pi = ||bi||2/||B||2,. 
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We further sample C2 columns i.i.d. from A, in each trial of which the i-th column is chosen 
with probability Pi. Let C2 G M™-><^2 contain the C2 sampled columns and let C = [Ci, C2] G 
]^mx(ci+c2)_ Then, for any integer k > 0, the following inequality holds: 

E||A-CC"fA||| < ||A- Afc||| + — ||A-CiC1a|||, 

C2 

where the expectation is taken w.r.t. C2. 

We will establish a more general and more useful error bound for this adaptive sampling 
algorithm in Theorem 4. It can be shown that Lemma 1 is actually a special case of 
Theorem 4. 

3.2 The Near-Optimal Column Selection Algorithm 

Recently, Boutsidis ct al. (2011) proposed a randomized algorithm which selects only c = 
2fce~^ (1+0(1)) columns to achieve the expected approximation factor (1+e). BoTitsidis 
et al. (2011) also proved the lower bound of the column selection problem; that is, at least 
c = ke~^ columns are selected to achieve the (1+e) ratio. Thus this algorithm is near 
optimal. Though an optimal algorithm recently proposed by GTiruswami and Sinoj) (2012) 
attains the the lower bound, this algorithm is quite inefficient in comparison with the near- 
optimal algorithm. So we prefer to use the near-optimal algorithm in our CUR and Nystrom 
algorithms for column/row sampling. 

The near-optimal algorithm includes three steps: the approximate SVD via random 
projection (Halko ct al., 2011, Boutsidis ct al., 2011), the dual set sparsification algo- 
rithm (Boutsidis et al., 2011), and the adaptive sampling algorithm (Deshpande et al., 
2006). We describe them in Algorithm 1 and present the theoretical analysis in Lemma 2. 

Lemma 2 (The Near-Optimal Column Selection Algorithm) Given a matrix A G 
jgmxn ^^^^ Q target rank k {2 <k < p), and < e < 1. Algorithm 1 selects 




columns of A to form a matrix C € W^^'^ , then the following inequality holds: 

E||A-CCtA|||. < {l + e)\\A-Ak\\l, 

where the expectations are taken w.r.t. C. Furthermore, the matrix C can be obtained in 
©(mfc^e-^/^ + nfc^e"^/^) + TMuitipiy (w-^-^e"^''^) time. 

This algorithm has the merits of low time complexity and space complexity. None of the 
three steps — the randomized SVD, the dual set sparsification algorithm, and the adaptive 
sampling — requires loading the whole of A into RAM. All of the three steps can work on only 
a small subset of the columns of A. Though a recently proposed algorithm in Guruswami 
and Sinop (2012) requires even fewer columns to achieve the (1+e) relative-error bound, it 
is less efficient than the near-optimal algorithm. 
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Algorithm 1 The Near-Optimal Column Selection Algorithm in Boutsidis et al. (2011). 

1: Input: a real matrix A G M™^", target rank fc, e e (0, 1], target column number c—^ (l+o(l)) ; 

2: Compute approximate truncated SVD via random projection such that A^ w U/jS^Vfe; 

3: Construct U -s— columns of (A — U/jS^Vfe); V ^ columns of V|"; 

4: Compute s -s— Dual Set Spectral-Frobenius Sparsification Algorithm {U, V, c— 2fc/e); 

5: Construct Ci ADiag(s), and then delete the all-zero columns; 

6: Residual matrix D ^ A — CiC|A; 

7: Compute sampling probabilities: Pi = ||di||2/||D|||,, i — 1, - ■ ■ ,n; 

8: Sampling C2 ~ 2k/e columns from A with probability {pi, ■ ■ ■ ,p„} to construct C2; 

9: return C = [Ci,C2]. 



3.3 The Previous Work in CUR Decomposition 

We introduce in this section two highly effective CUR algorithms: one is deterministic and 
the other is randomized. 

3.3.1 The Sparse Column- Row Approximation (SCRA) 

Stewart (1999) proposed a deterministic CUR algorithm and called it the sparse column- 
row approximation (SCRA). SCRA is based on the truncated pivoted QR decomposition 
via a quasi Gram-Schmidt algorithm. Given a matrix A G M"*^", the truncated pivoted 
QR decomposition procedure deterministically finds a set of columns C G M"*^'^ by column 
pivoting, whose span approximates the column space of A, and an upper triangular matrix 
Tc G M'^^'^ that orthogonalizes those columns. SCRA runs the same procedure again on A"^ 
to select a set of rows R G M^^" and the corresponding upper triangular matrix Tr, G M''^''. 
Let C = QcTc and R"^ = QrTr denote the resulting truncated pivoted QR decomposi- 
tion. The intersection matrix is computed by U = (T^Tc)~"^C'^AR'^(T^Tr)~^ Accord- 
ing to our following experiments, this algorithm is quite effective but very time expensive, 
especially when c and r are large. Moreover, this algorithm does not have data-independent 
error bound. 

3.3.2 The Subspace Sampling CUR Algorithm 

Drincas et al. (2008) proposed a two-stage randomized CUR algorithm which has a relative- 
error bound w.h.p. In the first stage the algorithm samples c columns of A to construct 
C, and in the second stage it samples r rows from A and C simultaneously to construct 
R and W and let U = W^^. The sampling probabilities in the two stages are proportional 
to the leverage scores of A and C, respectively. That is, in the first stage the sampling 
probabilities are proportional to the squared ^2-iiorm of the rows of Va,A;; in the second 
stage the sampling probabilities are proportional to the squared ^2-norm of the rows of Uc . 
That is why it is called the subspace sampling algorithm. Here we give the main results of 
the subspace sampling algorithm in the following lemma. 

Lemma 3 (The Subspace Sampling CUR Algorithm) Given a matrix A G M"^^" 

and an target rank k <^ min{m,n}, the subspace sampling algorithm selects c = 0{ke^'^logklog{l/5)) 
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columns and r = Oice ^logclog(l/(5)) rows without replacement. Then 

||A-CUR||i7 = ||A-CWtR||^ < (l + e)||A- AfelliT, 

holds with probability at least 1 — 5, where W contains the rows of C with scaling. The 
running time is dominated by the truncated SVD of A, i.e., 0{mnk). 

Although the algorithm is e-optimal with high probability, it requires too many rows to 
be chosen, that is, at least r = 0{ke~^log'^ k) rows in expectation. In this paper we seek 
to devise an algorithm with mild requirement on the numbers of columns and rows to be 
chosen. 

4. Main Results 

We now present our main results. Particulary, we establish a new bound for the adaptive 
sampling algorithm in Section 4.1. Based on this adaptive sampling bound, we devise an 
improved CUR algorithm in Section 4.2 and an improved Nystrom method in Section 4.3. 
In Section 4.4 we study lower bounds for the conventional Nystrom method to demonstrate 
the advantages of our approach. Finally, In Section 4.5 we show that our expected bounds 
can extend to "with high probability" bounds. 

4.1 Adaptive Sampling 

The relative-error adaptive sampling algorithm is originally established in Theorem 2.1 of 
Deshpande et al. (2006) (see also Lemma 1 in Section 3.1). The algorithm is based on the 
following idea: after selecting a proportion of columns from A to form Ci by an arbitrary 
algorithm, the algorithm randomly samples additional C2 columns according to the residual 
A — CiC| A. Here we prove a new and more general bound for the same adaptive sampling 
algorithm. 

Theorem 4 (The Adaptive Sampling Algorithm) Given a matrix A G M™^" and a 
matrix C G M"*^^ such that rank(C) = rank(CCtA) = p {p<c<n). We let Ri G M'^i^" 
consist of ri rows of A, and define the residual B = A — AR|Ri. Additionally, for 
i = 1, - ■ ■ ,m, we define 

p, = ||b»i/||B|||. 

We further sample r2 rows i.i.d. from A, in each trial of which the i-th row is chosen with 
probability pi. Let R2 G W"^^^ contain the r2 sampled rows and let R = [Rf,R|^]^ G 

EIIA- CC"fAR"fR||| < ||A-CC"fA||?, + — IIA- AR|Ri||?,, 
where the expectation is taken w.r.t. R2. 

Remark 5 This theorem shows a more general bound for adaptive sampling than the orig- 
inal one in Theorem 2.1 of Deshpande et al. (2006). The original one bounds the error 
incurred by projection onto the column space of C, while Theorem 4 bounds the error in- 
curred by projection onto the column space of C and row space of R simultaneously — such 
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situation rises in problems such as CUR and the Nystrom approximation. It is easy to 
check that Theorem 2.1 of Deshpande et al. (2006) is a direct corollary of this theorem when 
C = Afc. It is worth pointing out that Theorem 4 is not a trivial extension of Theorem 2.1, 
because the proof of Theorem 4 is not immediately obtained from the proof of Theorem 2.1 
in Deshpande et al. (2006). 

The adaptive sampling bound in Theorem 4 provides a useful tool for improving the 
CUR matrix decomposition and the Nystrom approximation. Guaranteed by Theorem 4, 
any column selection algorithm with (1+e) relative-error bound can be applied to CUR and 
the Nystrom approximation. We show the result in the following corollary. 

Corollary 6 (Adaptive Sampling for CUR and the Nystrom Approximation) Given 
a matrix A G M™^", a target rank k (<C m,n), and a column selection algorithm Acoi which 
achieves (1+e) relative-error upper bound by selecting c > C{k,e) columns. Then we have 
the following results for CUR and the Nystrom approximation. 

(1) By selecting c > C{k,e) columns of A to construct C and ri = c rows to construct 
Ri, both using algorithm Acoh followed by selecting additional r2 = c/e rows using the 
adaptive sampling algorithm to construct R2, the CUR matrix decomposition achieves 
an (1+e) relative- error upper bound in expectation: 

E||A-CUR||^ < (l + e)||A- Afcll^, 

where R = [Rf ,R^]^ and U = CtARt. 

(2) Suppose A is an m X m symmetric matrix. By selecting ci > C{k,e) columns of 
A to construct Ci using Acol cind selecting C2 = ci/e columns of A to construct C2 
using the adaptive sampling algorithm, the Nystrom approximation achieves an (1 + e) 
relative- error upper bound in expectation: 

E||A-CUC^||^ < (l + e)||A- Afcll^, 
where C = [Ci, C2] and U = CtA(Ct)^. 

Based on Corollary 6 as well as the adaptive sampling bound, we attempt to improve 
CUR and Nystrom algorithms. We present the concrete procedures in Section 4.2 and 4.3. 

4.2 Adaptive Sampling Based CUR Algorithm 

Guaranteed by the novel adaptive sampling bound in Theorem 4, we apply the column selec- 
tion algorithm of Boutsidis et al. (2011) to the CUR problem, obtaining an algorithm with 
a much tight theoretical bound than the existing algorithms. The algorithm is described in 
Algorithm 2 and its analysis is given in Theorem 7. Theorem 7 follows immediately from 
Lemma 2 and Corollary 6. 

Theorem 7 (An Adaptive Sampling Based CUR Algorithm) Civen a matrix A G 
j^mxn g positive integer k ^ min{?Ti, n}, the CUR algorithm described in Algorithm 2 
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Algorithm 2 Adaptive Sampling Based CUR Algorithm. 



1: Input: a real matrix A g M™^", target rank fc, e G (0, 1], target column number c—"^ (l+o(l)) , 

target row number r = |(1 + e); 
2: Select c = ^(l + o(l)) columns of A to construct C e M™^^ using Algorithm 1; 
3: Select ri = c rows of A to construct Ri e M''^^" using Algorithm 1; 
4: Adaptively sample ^2 = c/e rows from A according to the residual A — AR|Ri; 
5: return C, R = [Rf ,R^]^, and U = C^ARt. 



randomly selects c = — (l+o(l)) columns of A to construct C G M™^'^, and then selects 
r = |(l+e) rows of A to construct R G M^^". Then we have 

E||A-CUR||f = E||A- C(CtARt)R||^ < (1 + e)|| A - Afe||i.. 

Moreover, the algorithm runs in time C'((m+n)A;^e~^/'^+mA;^e^^+nA;^e^^)+TMultiply ("T-^^^e"^) • 

Since k,c,r <^ min{m, n} by the assumption, the time complexity of the improved CUR 
algorithm is linear in the size of A, that is, mxn, which is lower than that of the full SVD 
of A. Another advantage of this algorithm is that it can avoid loading the whole mxn data 
matrix A into RAM. Neither the near-optimal column selection algorithm nor the adaptive 
sampling algorithm requires loading the whole of A into RAM. The most space-expensive 
operation throughout this improved CUR algorithm is computation of the Moore-Penrose 
inverses of C and R, which requires maintaining an mxc matrix or an rxn matrix in RAM. 

Remark 8 // we replace the near-optimal column selection algorithm in Theorem 7 by the 
optimal algorithm of Guruswami and Smop (2012), then selecting totally c = A:e~"^(l -|-o(l)) 
columns and r = ce~^{l + e) rows suffices. But the optimal algorithm is less efficient than 
the near- optimal algorithm. 



4.3 Adaptive Sampling Based Nystrom Algorithm 

Theorem 4 also provides an approach for bounding approximation errors incurred by pro- 
jection simultaneously onto column space and row space. We therefore apply this approach 
to improve the Nystrom methods. Different from the conventional Nystrom methods where 
one uses rows of C to compute the intersection matrix, i.e., W defined in (1), we use 
C^A(C^)^ instead. The following theorem follows directly from Lemma 2 and Corollary 6. 

Theorem 9 (An Adaptive Sampling Based Nystrom Algorithm) Given a symmet- 
ric p.s.d. matrix A G ]^™x™ and a target rank k, with ci = ^(l + o(l)) columns sampled 
by Algorithm 1 and C2 = ci/e column sampled by the adaptive sampling algorithm, that is, 
with totally c = ^(l + o(l)) columns being sampled, the approximation error incurred by 
the improved Nystrom method is upper bounded by 



E||A-CUC^||^ < E 



< (l + e)||A-Afc||F. 

F 



A-C(^CtA(Ct)^j 

The algorithm runs in 0{mk^e-'^ + mk^e-'^/^) 

+ ^Multiply (^^^e ^) time 
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Table 3: Additive-error and relative-error lower bounds of the standard Nystrom method 
and the ensemble Nystrom method. The blanks indicate the lower bounds are 
unknown to us. 



Additive-Error 


||A-A||j=. IIA-AII2 ||A-A||, 
maxi j \aij\ max^ y |fli/| max; \aij\ 


Standard 
Ensemble 


0.99v'm-c-fc + fc(«)^ ^^Mjli+M 0.99(m-c)(l + ^) 
0.99^ {m - 2c + f - fc) + kr-Z!,r'T - 0.99(m - c)(l + ^) 



Relative-Error 


A— A F A— A 2 A— A , 

IIA-Afcllf l|A-Afc||2 ||A-Afc|U 


Standard 
Ensemble 


/t I m^k—c^ m m—c ( -[ | k\ 
\l c^(m-k) c rn-k \^ ^ c ) 


ni-2c+c/t-kf. k{'m-2c+c/t)\ m-c /i , k\ 
\ m—k I J m—k\ c) 



Remark 10 The relative- error hound in Theorem 9 is the only Frobenius norm constant- 
factor bound for the Nystrom approximation problem at present. If one uses the optimal 
column selection algorithm of Guruswami and Sinop (2012), which is less efficient, then the 
error bound is further improved: only c = ^(1 -|- o(l)) columns are required. 

We will show in the next subsection that our improved Nystrom algorithm is better 
than any other Nystrom methods that use W^^ as the intersection matrix. We will show 
that the conventional Nystrom methods, either the standard or the ensemble method, have 
a bottleneck pointed out in Theorem 11. Our improved Nystrom method overcomes the 
bottleneck by simply using an intersection matrix C"fA(C^^)"^ instead of W"^. 

4.4 Lovi^er Bounds of the Nystrom Method 

We now demonstrate to what an extent our improved Nystrom method is superior over 
the conventional ones. We show in Theorem 11 the lower error bounds of the conventional 
Nystrom methods. The results are summarized in Table 2. In other words, the conventional 
Nystrom methods do not work better than the lower bounds unless additional assumptions 
are made on the original matrix A. 

Theorem 11 (Lower Bounds of the Nystrom Method) Assume we are given ap.s.d. 
matrix A G M™^"*. Let A^ denote the best rank-k approximation to A. Let A denote either 
the rank c approximation to A constructed by the standard Nystrom method, or the approxi- 
mation constructed by the ensemble Nystrom method with t samples, each of which contains 
c columns of A. Then there exists a p.s.d. matrix such that for any sampling strategy the 
approximation errors of the conventional Nystrom methods, that is, || A — A||^, = 2, F, 
or "*"), are lower bounded by some factors which are shown in Table 3. 

Remark 12 The lower bounds in Table 3 (or Table 2) show the conventional Nystrom 
methods can he sometimes very ineffective. The spectral norm and Frobenius norm bounds 
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Table 4: A summary of the datasets for CUR matrix decomposition. 
Dataset Type Size ^Nonzero Entries Source 

Enron Emails twrt 39,861 x 28, 102 3,710,420 Bag-of-words, UCI 

Dexter text 20, 000 x 2, 600 248, 616 Guyou ct al. (2004) 

Farm Ads text 54, 877 x 4, 143 821, 284 Mesterharm and Pazzani (2011) 

Gisette handwritten digit 13, 500 x 5, 000 8, 770, 559 Guyon et al. (2004) 



even depend on m, so such bounds are not even constant-factor bounds. Notice that the 
lower error bounds do not meet ifW^ is replaced by C^^A(C"f)-^, so our improved Nystrom 
method is not limited by such lower bounds. 

4.5 Discussions of the Error Bounds 

The upper error bounds established in this paper all hold in expectation. Now we show that 
the expected error bounds immediately extend to w.h.p. bounds using Markov's inequality. 
Let the random variable X = ||A — A||ir/||A — A^Hi;' denote the relative-error, where 
A = CUR or CUC^. Then we have EX < 1 + e by the preceding theorems. By applying 
Markov's inequality we have that 

, , EX 1 + e 

F(X>l + se) < < 

^ ' 1 + se l + se 

Repeating the sampling procedure for t times and letting correspond to the relative 
error of the i-ih sample, we obtain 

p(|min{X(i)} > 1 + se) = P(^X(i) > 1 + se Vi = 1, • • • , < (j^^)* = 

which decays exponentially with t. Thus our CUR and Nystrom algorithms based on the 
adaptive sampling are also guaranteed with w.h.p. relative-error bounds by repeating the 
sampling procedure multiple times and choose the sample that minimizes ||A — P^\\f- Let 
s = 1 + log(l/(5), by repeating the sampling procedure t > 1 + 1/e times, the inequality 

||A-A||f < (l + e + elog(l/5)) ||A- Afclli. 

holds with probability at least 1 — 5. Equivalently, let s = 2, by repeating the sampling 
procedure t > (1 + 1/e) log(l/5) times, the inequality 

||A- A||i7 < (l + 2e) ||A- AfcllF 

holds with probability at least 1 — 5. 

5. Empirical Analysis 

We conduct empirical comparison among the CUR algorithms in Section 5.1 and the 
Nystrom algorithms in Section 5.2. We report the approximation error of each algorithm 
on each data set. The relative error is defined by 

1 II A — A.\\f 

Relative Error = — — — , 

||A - Ak\\F 
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where A = CUR for the CUR matrix decomposition, A = CW^C^ for the conventional 
Nystrom method, A = C(C^A(C^)"^)C"^ for the improved Nystrom method. 

We conduct experiments on a workstation with two Intel Xeon 2.40GHz CPUs, 24GB 
RAM, and 64bit Windows Server 2008 system. We implement the algorithms in MAT- 
LAB R2011b, and we use the MATLAB function 'svds' for truncated SVD. To com- 
pare the running time, all the computations are carried out in a single thread by setting 
'maxNumCompThreads(l)' in MATLAB. 
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Figure 1: Empirical results of the CUR algorithms on the Enron dataset. 



5.1 Comparison among the CUR Algorithms 

In this section we empirically compare our adaptive sampling based CUR algorithm in The- 
orem 2 with the subspace sampling algorithm of Drineas et al. (2008) and the deterministic 
sparse column-row approximation (SCRA) algorithm of Stewart (1999). We implement the 
subspace sampling algorithm and our adaptive sampling based CUR algorithm in MATLAB; 
as for SCRA, we use the MATLAB code released by Stewart (1999). 

We conduct experiments on four UCI datasets (Frank and Asuncion, 201U) which are 
summarized in Table 4. Each dataset is actually represented as a data matrix, upon which 
we apply the CUR algorithms. According to our analysis, the target rank k should be far 
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less than m and n, and the column number c and row number r should be strictly greater 
than k. For each dataset and each algorithm, we set A: = 10 or 50, and c = ak, r = ac, where 
a ranges in each set of experiments. We repeat each of the two randomized algorithms 10 
times and report the minimum relative error and the total elapsed time of the 10 rounds. 
We report the relative errors and the elapsed time of the three CUR matrix decomposition 
algorithms in Figures 1, 2, 3, and 4. 
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Figure 2: Empirical results of the CUR algorithms on the Dexter dataset. 

We can see from Figures 1, 2, 3, and 4 that our adaptive sampling based CUR algorithm 
has much lower approximation error than the subspace sampling algorithm in all of the 
experiments. Our adaptive sampling based algorithm is better than the deterministic SCRA 
on the Farm Ads dataset and the Gisette dataset, worse than SCRA on the Enron dataset, 
and comparable to SCRA on the Dexter dataset. In addition, the experimental results well 
match our theoretical analysis in Section 4. The empirical results all obey the theoretical 
relative-error upper bound 

IIA- CUR||i7 2k , , 2 , , 

iiA-A.iir - = 1+^(1 

As for the running time, the subspace sampling algorithm and our adaptive sampling 
based algorithm are much more efficient than SCRA, especially when c and r are large. Our 
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Figure 3: Empirical results of the CUR algorithms on the Farm Ads dataset. 



adaptive sampling based algorithm is comparable to the subspace sampling algorithm when 
c and r are small; however, our algorithm becomes less efficient when c and r are large. 
This is due to the following reasons. First, the computational cost of the subspace sampling 
algorithm is dominated by the truncated SVD of A, which is determined by the target rank 
k and the size and sparsity of the data matrix. However, the cost of our algorithm grows 
with c and r, thus our algorithm becomes less efficient when c and r are large. Second, the 
truncated SVD operation in MATLAB, that is, the 'svds' function, gains from sparsity, but 
our algorithm does not. The four datasets are all very sparse, so the subspace sampling 
algorithm has advantages. Last but not least, the truncated SVD functions are very well 
implemented by MATLAB (not in MATLAB language but in Fortran/C). By contrast, 
our algorithm is implemented in MATLAB language, which is usually less efficient than 
Fortran/C. 



5.2 Comparison among the Nystrom Algorithms 

In this section we empirically compare our adaptive sampling based algorithm in Theo- 
rem 9 with some other sampling algorithms including the subspace sampling in Driiicas 
ct al. (2008) and the uniform sampling, both without replacement. We denote the selected 
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columns by C = A.^j G 



where X is an index set. 



method (denoted by 'conventional' in the figures) forms 
section matrix, while the improved method (denoted by 
as the intersection matrix. 



Then the conventional Nystrom 
U = (Cx,.)t G M^x^ as the inter- 
'improved') uses U = CtA(Ct)^ 
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Figure 4: Empirical results of the CUR algorithms on the Gisette dataset. 

We test the algorithms on three datasets which are summarized in Table 5. For each 
dataset we generate radial basis function (RBF) kernel matrix A which is defined by 



Aij = exp 



X7 X7 



2(j2 



where Xj and Xj are data instances and cr is a parameter defining the scale of the kernel. 
Notice that the RBF kernel is dense in general. We set a = 0.2 or 1 in our experiments. For 
each dataset with different settings of a, we fix a target rank k = 10, 20 or 50 and vary c in 
a very large range. We run each algorithm for 10 times and report the the minimum relative 
error and the total elapsed time of the 10 repeats. The results are shown in Figures 5, 6, 
and 7. 

The effects of a (of the RBF kernel) on the spectrum and leverage scores are discussed 
in Section 3.1 in Gittcns and Mahoncy (2013): as a increases, the spectrum of the RBF 
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kernel decays more rapidly and the leverage scores become more uniform. The effects of a 
have two important implications. On one hand, when a is large, the RBF kernel can be well 
approximated by a low-rank matrix (see Table 5), which implies that (i) a small k suffices 
when a is large, and (ii) k should be set large when a is small. So the results with (cr = 1, 
k = 10) and (cr = 0.2, k = 50) are more interesting than the rest. On the other hand, when 
a is small, the leverage scores become heterogeneous, thus the effect of choosing "good" 
columns is significant. Figures 5, 6, and 7 show that the advantage of subspace sampling 
and adaptive sampling over uniform sampling is significant whenever the standard deviation 
of the leverage scores are large (see Table 5), and vise versa. 

The experimental results also show that the subspace sampling and adaptive sampling 
algorithms significantly outperform the uniform sampling when c is reasonably small, say 
c < 10. This indicates that the subspace sampling and adaptive sampling algorithms are 
good at choosing "good" columns as basis vectors. Furthermore, the experimental results 
show that using U = C^^A(C^)'^ as the intersection matrix (denoted by "improved" in the 
figures) leads to much lower relative error than using U = W^^ (denoted by "conventional"). 
When all of the algorithms use U = C^A(C^)^ as the intersection matrix, our adaptive 
sampling based algorithm achieves the lowest approximation error in most cases. The 
relative errors of our adaptive sampling based Nystrom algorithm are in accordance with 
the theoretical bound in Theorem 9, that is. 



As for the running time, our algorithm is more time efficient than the subspace sampling 
algorithm. This is partly because the RBF kernel matrix is dense and thus the truncated 
SVD is slow. 

6. Conclusion 

In this paper we have built a novel and more general relative-error bound for the adaptive 
sampling algorithm. Accordingly, we have devised novel CUR matrix decomposition and 
Nystrom approximation algorithms which significantly improve the classical counterparts. 
Our improved CUR algorithm requires only c = 2/ce~^(l + o(l)) columns and r = ce~^(l+e) 
rows selected from the original matrix to achieve (1+e) relative-error bound. To achieve 
the same bound, the best previous algorithm — the subspace sampling algorithm — requires 
c = 0{ke~'^ log A;) columns and r = 0{ce~'^ logc) rows. Our improved Nystrom method is a 
little different from the conventional Nystrom method in that it uses a different intersection 
matrix. We have shown that our method achieves (1+e) relative-error upper bound by 
sampling only c = 2A;e~^(l+o(l)) columns, which even beats the lower error bounds of the 
conventional Nystrom method. Our improved CUR algorithm and the improved Nystrom 
algorithm are scalable because they need only to maintain a small fraction of columns 
or rows in RAM, and their time complexity is linear in the matrix size, that is, mxn. 
Finally, the empirical comparison has also demonstrated the effectiveness and efficiency of 
our algorithms. 



A - CUC^ 



A-Ak\\F 



< 1 + 



^j'ii^+oii)) = 1+^^(1+0(1)) 
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Figure 5: Empirical results of the Nystrom algorithms on the RBF kernel of the Abalone 
dataset. 
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(f) a = 1, k = 50, and c — ak. 



Figure 6: Empirical results of the Nystrom algorithms on the RBF kernel of the Wine 
Quality dataset. 
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(a) a = 0.2, k = 10, and c = ak. (b) a = 0.2, k = 20, and c = ak. (c) cr = 0.2, k = 50, and c = ak. 
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(d) (T = 1, fc = 10, and c = ak. 



(e) a — \, k — 20, and c = ak. 



(f) (T = 1, fc = 50, and c = ak. 



Figure 7: Empirical results of the Nystrom algorithms on the RBF kernel of the Letters 
dataset. 
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Table 5: A summary of the datasets for the Nystrom approximation. In the second tabular 
std(£['^l) denotes the standard deviation of the statistical leverage scores of A 
relative to the best rank-A; approximation to A. We use the normalization factor 
m/k because ymean(£W) = 1. 



Dataset ^Instances ^Attributes Source 

Abalone 4, 177 8 UCI (. rank and Asuncion, 2010) 

Wine Quality 4,898 12 UCI (Cortez et al., 2009) 

Letters 5,000 16 Statlog (Michie et al., 1994) 







l|A 


- Afc||f/||A||f- 




f std(^''=<) 








fc = 10 


fc = 20 


fc = 50 


fc = 10 


fc = 20 


fc = 50 


Abalone (a = 


0.2) 


0.4689 


0.3144 


0.1812 


0.8194 


0.6717 


0.4894 


Abalone (a = 


1.0) 


0.0387 


0.0122 


0.0023 


0.5879 


0.8415 


1.3830 


Wine Quality (cr = 0.2) 


0.8463 


0.7930 


0.7086 


1.8703 


1.6490 


1.3715 


Wine Quality (cr = 1.0) 


0.0504 


0.0245 


0.0084 


0.3052 


0.5124 


0.8067 


Letters (cr = 


0.2) 


0.9546 


0.9324 


0.8877 


5.4929 


3.9346 


2.6210 


Letters (cr = 


1.0) 


0.1254 


0.0735 


0.0319 


0.2481 


0.2938 


0.3833 



Appendix A. The Dual Set Sparsification Algorithm 

For the sake of self-containing, we attach the dual set sparsification algorithm and describe 
some implementation details. The dual set sparsification algorithm established in Boutsidis 
ct al. (2011) is deterministic. The improved CUR algorithm 2 calls the dual set spectral- 
Frohenius sparsification algorithm in Lemma 13 of Boutsidis ct aJ. (2011) in the both stages. 
We show this algorithm in Algorithm 3 and its bounds in Lemma 13. 

Lemma 13 (Dual Set Spectral- Frobenius Sparsification) Let U = {xi, -- ,x„} C 
M' (/ < n) contain the columns of an arbitrary matrix^ G M'^". LetV = {vi, • • • , v„} C M'^ 
{k < n) be a decompositions of the identity, i.e., 'Ylll=i^i^J ~ -'-fc- Given an integer r with 
k < r < n, Algorithm 3 deterministically computes a set of weights Sj > (i = 1, - ■ ■ ,n) at 
most r of which are non-zero, such that 

^k(^^SiVivf^ > (^l - J-^ and tr ^ SjXjxf ^ < ||X|||. 

1=1 i=l 

The weights Si can be computed deterministically in Of^rnk"^^ + ^Multiply (^0 ^^"^6. 

We would like to mention some implementation issues of Algorithm 3 which were not 
described in detail by Boutsidis ct al. (2011). In each iteration the algorithm performs 
once eigenvalue decomposition: A^- = WAW-^. Here A,- is guaranteed to be p.s.d. in each 
iteration. Since 

(a, - alk)" = WDiag((Ai - a)", • • • , (A^ - «)<?) W^, 

we can efficiently compute (A,- — {Lr + 1)1^)'' based on the eigenvalue decomposition of A,-. 
With the eigenvalues at hand, (l){L, A-^) can also be computed directly. 
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Algorithm 3 Deterministic Dual Set Spectral-Frobenius Sparsification Algorithm. 
1: Input: U = {xjf^i C M', [l < n); V = {vJ:[Li C M^ with Y.7=i v»vf ^ Ik (k < n); k < r < 

2: Initialize: So = 0, Aq = 0; 

3: Compute ||xi||| for i — 1, - ■ ■ ,n, and then compute Sjj = ^i^si-|=k; 
for r = to r — 1 do 



Compute the eigenvalue decomposition of A,-; 

Find any index j in {1, • • • , n} and compute a weight t > such tliat 



vf(A^-(i^ + l)Ifc) Vj 1 
^^^""-^■ll^ ^ ^ .^(x; + l,A.)-0(LiA.) -^"(^-(^- + ^)^O 



where 



''^ -1 
0(i:,A) =^(a,(A)-l) , Lr = T-V^; 



Update the j-th component of and A,-: St-+i[7] = Sr[j] + t, Ar+i = A^ + ^^j'^J 
end for 

return s = — s^. 



The algorithm runs in r iterations. In each iteration, the eigenvalue decomposition of A,- 
requires 0{k^), and the n comparisons in Line 6 each requires 0{k'^). Moreover, computing 
||xj||| for each Xj requires TMuitipiy (^^O- Overall, the running time of Algorithm 3 is at most 
0(r/c3) + 0{rne) + rMuitipiy(raO = ©(rnfc^) + TwuitipiylnO. 

The near-optimal column selection algorithm described in Lemma 2 has three steps: ran- 
domized SVD via random projection which costs ©(m/c^e"^/^) + TMuitipiy ("^'^^e^^^^) time, 
the dual set sparsification algorithm which costs ©(nfc^e^^/^) + TMuitipiy ("i"-) time, and the 
adaptive sampling algorithm which costs ©(m/c^e"^/^) + ^Multiply (^T^nA;e~^'''^) time. There- 
fore, the near-optimal column selection algorithm costs totally 0[mk'^e~^^^ + nk^e~'^^^) + 
Tuviitipiyi'mnke''^/^) time. 

Appendix B. Proofs of The Adaptive Sampling Bounds 
B.l The Proof of Theorem 4 

Theorem 4 can be equivalently expressed in Theorem 14. In order to stick to the column 
space convention throughout this paper, we prove Theorem 14 instead of Theorem 4. 

Theorem 14 (Adaptive Sampling Algorithm) Given a matrix A G ]^mx" and a ma- 
trix R G M*"^" such that rank(R) = rank(AR"l"R) = p {p < r < m), let Ci G R'^^^'^i consist 
of ci columns of A, and define the residual B = A — CiC|A. For i = 1, - ■ ■ ,n, let 

Pi = \\hig/\\B\\l, 

where hi is the i-th column of the matrix B. Sample further C2 columns from A in C2 i.i.d. 
trials, where in each trial the i-th column is chosen with probability pi. Let C2 G W^^'^'^ 
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contain the C2 sampled columns and C = [Ci, C2] G K»"x(ci+C2) contain the columns of both 
Ci and C2, all of which are columns of A. Then the following inequality holds: 

E||A-CC^AR"fR||| < ||A- ARtR|||, + -^||A- CiCIaII^. 
where the expectation is taken w.r.t. C2. 

Proof With a little abuse of symbols, we use bold uppercase letters to denote random 
matrices and bold lowercase to denote random vectors, without distinguishing between 
random matrices/vectors and non-random matrices/vectors. 

We denote the j-th column of V^j^tR.p G W^^^ as Vj, and the (i, j)-th entry of V^j^tR,p 
as Vij. Define random vectors G such that for j = 1, • • • , n and I = 1, - ■ ■ , C2, 

X,- (I) = —hi = (a.i — CiCtaj ) with probability pi, for i = 1, • • • , n, 

Pi Pi ^ ' 

Note that is a linear function of a column of A sampled from the above defined 

distribution. We have that 

n 

Vi4 , 



'J' 

Pi 

1=1 ^* 



i=l * 1=1 " " "-l^ 

Then we let Xj = ^ Yl'iLi have 



E[x,-] 



E| 










2 1 


E 




= — E 






2 C2 



2 _ 1 

2 C2 



-Ellx,- c;-! — Bv 



According to the construction of xi,--- ,Xp, we define the C2 columns of A to be C2 G 
]^mxc2_ ]\jote that all the random vectors xi • • • , Xp lie in the subspace span(Ci) + span(C2). 
We define random vectors 

Wj = CiCj AR'I'Rvj + Xj = CiC|Avj + Xj, for j = 1, • • • , p, 

where the second equality follows from Lemma 15; that is, AR^Rvj = Avj if Vj is one of 
the top p right singular vectors of AR^^R. Then we have that any set of random vectors 
{wi, • • • , Wp} lies in span(C) = span(Ci) + span(C2). Let W = [wi, • • • , Wp] be a random 
matrix, we have that span(W) C span(C). The expectation of Wj is 

E[wj] = CiCjAvj + E[xj'] = CiC| Avj + Bvj = Avj, 
therefore we have that 

Wi-Avj=Xj-Bvj. 
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The expectation of ||wj — Av 



Ellw,- - Av 



j\\2 



Jll2 

Ellx 



IS 



Bv 



< 



1 

C2 
1 

C2 
1 

C2 



-Ellx, 



B||^. 



To complete the proof, we denote 



j\\2 



1 

C2 



Bv 



j\\2 



|2--(Bv,fE[x,-(,)] + -||Bv 



C2 



C2 



iBv 



j\\2 



C2 



IBI 



C2 



2 

7112 



C2 



IBv 



ill2 



(2) 



F = E^g"'wg<)ARtR, 

9=1 

where cxg is the g-th largest singular value of AR^R and Uq is the corresponding left singular 
vector of AR^^R. The column space of F is contained in span(W) (c span(C)), and thus 

II AR+R - CC+AR'^RllI < || AR+R - WWtARl"R||| < || AR+R - F|||.. 

We use F to bound the error || ARtR - CCtARtR||2,. That is, 

|2 



E||A-CCtARtR||^ 



E|| A - AR"fR + AR+R - CC"fAR"fR||> 



E 



I A - ARtRlPp + II ARI'R - CCtAR+RllI, 



(3) 



< ||A - AR+Rlll, + E||AR1'R- F|||,, 



where (3) is due to that A(I - R+R) is orthogonal to (I - CCt)ARtR. Since AR^R and 
F both lie on the space spanned by the right singular vectors of AR'I'R (i.e., {vj}^^^), we 
decompose AR"''R — F along {vj}^^^, obtaining that 



EIIA- CC'^AR'^Rlll < 



< 



A 
A 



AR+R 
AR+R 

A - ARI^R 

A - ARtR 

A - ARtR 
A - ARtR 



|, + E||ARtR-F||^, 

p 

|, + ^E (ARtR-F)vj 

i=i 

p p 
I + ^E||ARtRv, - (^a-iw,u^)a,-uj 

3 = 1 <?=! 
P 11 2 

]p + ^E ARtRvj - Wj 

i=i 



F + 5^IE||Avj-w,-| 

i=i 



l^ + ^l|B| 

C2 



2 



(4) 
(5) 



where (4) follows from Lemma 15 and (5) follows from (2). 
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Lemma 15 We are given a matrix A G and a matrixH G W^"^ such t/iai rank(AR^^R) = 

rank(R) = p {p < r < m). Letting Vj G M" he the j-th top right singular vector of AlOU, 
we have that 

AR"fRvj = Avj, for j = !,■■■ , p. 

Proof First let ^n.,p G W^^p contain the top p right singular vectors of R. Then the 
projection of A onto the row space of R is AR^R = AVR^pV^ p. Let the thin SVD of 
AVr,p G W^^'p be USV^, where V G W^. Then the compact SVD of AR+R is 

ARtR = AVr,,V^,^ = USV^V^,^. 

According to the definition, Vj is the j-th column of (VR^pV) G W^^p. Thus vj lies on 
the column space of V^^p, and Vj is orthogonal to Vk^p_l. Finally, since A - ARtR = 
AVrp^V^p_i_, we have that Vj is orthogonal to A — AR^R, that is, (A — ARtR)vj = 0, 
which directly proves the lemma. ■ 



B.2 The Proof of Corollary 6 

Proof Since C is constructed by columns of A and the column space of C is contained 
in the column space of A, we have rank(CCtA) = rank(C) = p < c. Consequently, the 
assumptions of Theorem 4 are satisfied. The assumptions in turn imply 

||A-CCtA||^ < (l + e)||A-Afc||,., 
||A-Ar1Ri||^ < (l + e)||A- AfclliT, 

and c/ri = e. It then follows from Theorem 4 that 



ErIIA- CCtARtRlPp 



|A - CCtARtR||2 



Ri 



< ERj||A-CCtA||| + -^||A- AR|Ri||| 

L r2 

< ||A-CCtA||^ + -(l + e)||A- Afclll 



r2 



|A - CCtAII?, + e(l + e)||A - A 



A; ||_F- 



Furthermore, we have that 



EIIA - CURIIf 



< E||A-CUR|||, = Ec,r||A - CCtARtRllI 



= Ec 
< Ec 



E- 



R 



|A - CCtARtR||2, 



|A - CCtAlPp + e(l + e)||A - A 



kUF 



< (l + e)2||A- Afclll, 



which yields the error bound for CUR matrix decomposition. 
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The matrix A is symmetric, so Cj consists of the rows A, and thus we can use Theo- 
rem 14 (which is identical to Theorem 4) to prove the error bound for the Nystrom approx- 
imation. By replacing R in Theorem 14 by C^, we have that 



E A 



CCtA(Ct)^Cr||t 



< 



If 



Cl 



A(Cl)^C{ ||; + -1|A 



1 + 



Cl 
C2 



|A- CiCiA 



C2 
|2 



CiClAll?, 



where the expectation is taken w.r.t. C2. Together with the inequality 

||A-CCtA(Ctf C^ll^ < \\A-CC^A{C\fcJ\\l 
given by Lemma 16, we have that 



E 



Cl,C2 



|A-CCtA(Ctf C^llJ, < Ec„cjA-CCtA(Clf Cfll' 



Cl 



1 + ^ EcjA-CiClA| 



C2^ 

(l+e)2||A-Afc"' 



Hence E||A-CCtA(Ct)^C^||^ < [e|| A-CCtA(Ct)^C^||5, 



< (l+e)||A-AJ 



Lemma 16 Given an mxm matrix A and an mxc matrix C = [Ci,C2], the following 
inequality holds: 

||A-CCtA(Ct)^C^||^ < \\A-CC^A{C\fc'{\\l. 

Proof Let Vc-^ = CC^A denote the projection of A onto the column space of C, and 
'Pc = Im — CC^ denote the projector onto the space orthogonal to the column space of C. 
It has been shown by Halko et al. (2011) that, for any matrix A, if span(M) C span(N), 
then the following inequalities hold: 

IIT'mAII^ < IIPnAII^ and H^PmAH^ > WP^Mk- 

Accordingly, AV^t = AR^R is the projection of A onto the row space of R € M*"^". We 
further have that 

\\A-VcAVl\\l = ||A-PcA + PcA-PcApg||| 

= llPcA + PcAPglll = \\PcA\\l + WVcAV^Wl 

and 

\\A-VcAV^jl = IIA-PcA + PcA-PcAT'gjil. 

= \\PcA + VcAp^jl = \\PcA\\l + \\VcAp^jl, 

where the last equalities follow from Vc -L Pc- Since span(Ci) C span(C), we have 
1 1 Vc AP'^^ I ll' > 1 1 T^c AP^ 1 1|, , which proves the lemma. ■ 
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B.3 The Proof of Theorem 7 



Proof The error bound follows directly from Lemma 2 and Corollary 6. The near-optimal 
column selection algorithm costs 0{mk'^e''^/^ + n/c^e^^/^) + ^Multiply ("t-^^^e^^^'^) time to 
construct C and 0(nA;^e~^/^+mfc^e~^/^) +?Muitipiy ("infce"^/^) time to construct Ri. Then 
the adaptive sampling algorithm costs ©(nfc^e"^) +7Muitipiy ("T-'^^e"^) time to construct R2. 
Computing the Moore-Penrose inverses of C and R costs 0{mc^) + 0{nr^) = 0{mk'^e~^ + 
n/c^e~^) time. The multiplication of C'^'AR"'' costs TMuitipiyC^i^-c) = TMu\tip\y{i^nke~^) time. 
So the total time complexity is C'((m+n)/!;^e~^/^+mA;^e~^+n/c^e~^) +TMvL\tip\y {i^^nke'^) . ■ 



B.4 The Proof of Theorem 9 

Proof The error bound follows immediately from Lemma 2 and Corollary 6. The near- 
optimal column selection algorithm costs 0(m/c^e~'^/^ + m/c^e~^/^) + ^Multiply (?Ti^A;e~^/^) 
time to select ci = 0{ke^^) columns of A construct Ci. Then the adaptive sampling 
algorithm costs 0[mk'^e~^) + TMuitipiy ("T-^^e "'^) time to select C2 = 0{ke~^) columns con- 
struct C2. Finally it costs 0{mc^) + TMuitipiyC^T-^c) = 0{mk'^€~^) + JMuitipiy ("i^^e"^) 
time to construct the intersection matrix U = C^^A(Cl^)^. So the total time complexity is 
0{mk'^e-^ + mk^e-^/^) + rMuitipiy(m^fce~^)- ■ 



Appendix C. Proofs of The Lower Bounds 

In Section C.l we construct the adversarial cases which will be used throughout this section. 
In Section C.2 we prove the lower bounds of the standard Nystrom method. In Section C.3 
we prove the lower bounds of the ensemble Nystrom method. Theorems 19, 20, 21, 23, and 
24 are used for proving Theorem 11. 

C.l Construction of The Adversarial Cases 

C.1.1 The Adversarial Case for The Spectral Norm Bound 

We construct an mxm positive definite matrix B as follows: 



B = (1 - a)Im + almlm = 



where a G [0, 1). It is easy to verify x'^Bx > for any nonzero x G M™. We show some 
properties of B in Lemma 17. 



1 

a 



a 
1 



a 
a 



W Bi^i 



B21 B 



22 



(6) 



a a 
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Lemma 17 Let be the best rank-k approximation to the matrix B defined in (6). Then 
we have that 

W^Wf = \Jm?(^ + m(\ — q2), ||B — Bfclli? = Vm — k{l — a), 
||B||2 = 1 + ma — a , ||B — Bfc||2 = 1 — a, 

||B||* = m, ||B — Bfcll* = {m — k){l — a), 

where 1 < k < m — 1. 

Proof The squared Frobenius norm of B is 

IIBIll' = ^^Oij = m + (m^ — m)a^. 

id 

Then we study the singular values of B. Since B is p.s.d., here we do not distinguish 
between its singular values and eigenvalues. 

The spectral norm, i.e., the largest singular value, of B is 

||B||2 = o"! = Ai = max x"^Bx = max (1 — a)||x||2 + a(l^x)^ = 1 — a + ma, 

l|x||2<l llxi|2<l 

where the maximum is attained when x = -^1^. Thus ui = -^1^ is the top singular 
vector of B. Then the projection of B onto the subspace orthogonal to ui is 

Bi_L=B-Bi = B-CJiUiuf = -{mim - Iml-m)- 

m 

Then for all j > 1, the j'-th top eigenvalue aj and eigenvector u^, i.e., the singular value 
and singular vector, of B satisfy 

ajUj = Buj = Bij_Uj = —— [muj - {l^Uj)lrn) = ^ (mUj-0), 

where the last equality follows from Uj _L ui, i.e., l^Uj = 0. Thus aj = 1 — a, and 

||B-Bfc||2 = cjfc+i = l-a 
for all 1 < k < m. Finally we have that 

k 



|B-Bfe||2, = \\B\\l-^a^ = (m-A;)(l-a)2, 

i=l 

||B — Bfcll* = {m — k)(j2 = {m — k){l — a), 

m 

||B||* = ^^fj = (1 + ma — q) + (m — 1)(1 — a) = m, 



1=1 

which complete our proofs. 
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C.1.2 The Adversarial Case for The Frobenius Norm and Nuclear Norm 
Bounds 



Then we construct another adversarial case for proving the Frobenius norm and nuclear 
norm bounds. Let B be a p x p matrix with diagonal entries equal to one and off-diagonal 
entries equal to a. Let m = kp and we construct an. m x m block diagonal matrix A as 
follows: 



BlkDiag(B,--- ,B) 



k blocks 



B 
B 








B 



(7) 



Lemma 18 Let A^ he the best rank-k approximation to the matrix A defined in (7). Then 
we have that 



ai(A) 
o-fc+i(A) 
|A - Afcll^ 
l|A- Afcll 



(Tfc(A) = I + pa - a, 
crm(A) = l-a, 



(1 — a)V m — k, 
(1 — a)(m — fc). 



Lemma 18 can be easily proved using Lemma 17. 



C.2 Lower Bounds of The Standard Nystrom Method 

Theorem 19 For an m x m matrix B with diagonal entries equal to one and off- diagonal 
entries equal to a ^ [0, 1), the approximation error incurred by the Nystrom method is lower 
hounded by 



|B-B,"^«|L > (^_c) 1 + 



m + c + 



\-B-By\ 



|B-B"^1 



> 



c + 



l-a 



> (m — c)(l — a) 



1 + CQ 

1 + ca — a 



Furthermore, the matrix (B — Be ) is p.s.d. 

Proof The matrix B is partitioned as in (6). The residual of the Nystrom approximation 
is 

||B-B°yi5 = ||B22-B2iWtBi;||5, (8) 

where ^ = 2, F, or *. Since W = (1 — a)lc + alcl'^ is nonsingular when a G [0, 1), so 
W^^ = W~^. We apply the Sherman-Morrison- Woodbury formula 



(A + BCD 



1-1 



-1 



A^B(C^ + DA^B)^DA 



-1 
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to compute W''^, yielding 

= — Ic-7 ^ Ml 

1 — a (1 — a)(l — a + ca) 

According to the construction, B21 is an (m—c) x c matrix with ah entries equal to a, it 
follows that B2iWtB^^ is an (m— c)x(m— c) matrix with all entries equal to 

4 a^l^Wtl, = . (9) 

\ — a + ca 

Then we obtain that 

B22-B2iWtBi; = (l-a)I^_e + (a-r?)l^_eC_e. (10) 

It is easy to check that < a < 1, thus the matrix {l—a)lm-c + {a — rj)lm-c^Tn-c is p.s.d., 
and so is (B — B"'"^). 

Combining (8) and (10), we have that 

\\-B-^T\\l = ||(l-a)I™_, + (a-r/)l„_,l^_J^ 

= (m— c) (l— ??)^ + ^(m— c)^ — (m— c)^ (a— ??)^ 

= (m-c)(l-a)^(l + ^^|±^^), (11) 

which proves the Frobenius norm of the residual. 

Now we compute the spectral norm of the residual. Based on the results above we have 
that 

||B-B°y^||2 = ||(l-a)I™-c + (a-??)lm-cl^_c||2- 
Similar to the proof of Lemma 17, it is easily obtained that ;^^==^lm_c is the top singular 
vector of the p.s.d. matrix (1 — a)Im-c + (a — ry)lm-clm-c' the top singular value is 

(1 -a)fm+ 

ai(B-Br) = {m-c){a-v) + l-a = (12) 

which proves the spectral norm bound because ||B — Bc'"^||2 = <ti(B — Bc^*^). 
It is also easy to show the rest singular values obey 

a2{B-BT) = ■■■ = am-c{B-BT) > 0, 
a^_,+i(B-B°y^) = ••• = cT^(B-B°y«) = 0. 

Thus we have, for i = 2, - ■ ■ ,m — c, 

. iiB-Brii^-.;(B-Br) ^ 

^ ^ m - c - 1 
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The nuclear norm of the residual (B — B"^*^) is 



|B-B°y^l|, 



X:a(B-Br) 



i=l 



ai(B-B^y'^) + (m-c-l)(T2(B-B 
(m — c)(l — 7]) 

(.m-c){l-a)(l+ ^ 



r) 



c + 



(13) 



The theorem follows from equalities (11), (12), and (13). ■ 

Now we use the matrix A constructed in (7) to show the Frobenius norm and nuclear 
norm lower bound. The bound is stronger than the one in Theorem 19 by a factor of k. 



Theorem 20 For the m x m p.s.d. matrix A defined in (7), the error incurred by the 
conventional Nystrom method is lower bounded by 



|A-CWtcn|^ > {l-a)^m-c-k + ^^-^j^, 
||A-CWtC^||^ > {l-a){m-c)[l + -^^^ 



where k < m is an arbitrary positive integer. 

Proof Let C consist of c column sampled from A and Cj consist of Cj columns sampled 
from the i-th block diagonal matrix in A. Without loss of generality, we assume Cj consists 
of the first q columns of B, and accordingly Wj consists of the top left q x Cj block of B. 
Thus C = BlkDiag(Ci,-- - , C^) and W = BlkDiag(Wi, • • • ,Wfc). 



AT = CW+C 



^1 


Ci 
( 







Wi 



wj 







C.W'C 



k^k 



(14) 
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Then it follows from Theorem 19 that 



i=l 



Bp-^.)(l-a)^(l + ''^t;fJlT 

i=l + Q j 

k 

p + c 



£2 

i=l » 



(1 — a)^ f m — c — k + c, 



A: 
i=l 



where j3 = p + and q = Cj + ^-j^. Since XlLi Cj = c + ^-^k = c, the term Yli=i ^ is 
minimized when ci = • • • = Cfc. Thus c^^ = k^ = k^c~'^. Finally we have that 



k 

\A-A^y'\\l = {l-af(m-c-k+p^Yj^i^ 
> {I- a)'^(^m- c- k + 



i=l 

n / n,i m -1 



by which the Frobenius norm bound follows 

Since the matrices B — ' 
is also p.s.d. We have that 



Since the matrices B — CiWiC?" are all p.s.d. by Theorem 19, so the matrix (A — Ac^^) 



|A-A°y^|| = V||B-CiWjcf|| 

1=1 

k 



1=1 



, , ,m c , / 1 

> (l-aU(— -- 1 + — J 

- ^ ' ^k k'\ ck + ^ 



c/k + 

= (i-«)("^-^)(i + rTiE^)' 

where the former inequality follows from Theorem 19, and the latter inequality follows by 
minimizing w.r.t. ci, • • • , subjecting to ci + • • • + = c. ■ 



Theorem 21 (Relative-error lower bounds of the Nystrom method) There exists 
an mxm p.s.d. matrix A such that the relative- error ratios of the Nystrom method are 
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lower bounded by 



> W1 + 



A-Afc||^ V c^{m-k)' 



|A-CWtC^| 



2 ^ m 



||A-Aa;||2 c' 

A-CWtC^IU m-cf k 
> r 1 + 



II A — Afcll* m — k\ c 

where k < m is an arbitrary positive integer. 

Proof For the spectral norm bound we use the matrix A constructed in (6) and set a — )• 1, 
then it fohows directly from Lemma 17 and Theorem 19. For the Frobenius norm and 
nuclear norm bounds, we use the matrix A constructed in (7) and set q — t- 1, then it follows 
directly from Lemma 18 and Theorem 20. ■ 



C.3 Lower Bounds of The Ensemble Nystrom Method 

Lemma 22 Assume that the ensemble Nystrom method selects a collection of t samples, 
each sample C^*^ (i = 1, - ■ ■ ,t) contains c columns o/B without overlapping. For anmxm 
matrix B with all diagonal entries equal to one and off-diagonal entries equal to a £ [0, 1), 
the Frobenius norm and nuclear norm errors incurred by the ensemble Nystrom method are 
lower bounded by 



,c \\F 



> (l-a)J m-2c+- 1 + 



m + f + 



t ' a 



|B-Bf-||^ > (l-a)(m 



c + 



c + 



l-a ■ 



where Bf'^" = ^ Ya.=i C^^^W^^^^C^*)^. Furthermore, the matrix (B - Bl").') is p.s.d. 



Proof We use the matrix B constructed in (6). It is easy to check that W^^) = • • • = w(*), 
so we use the notation W instead. We assume that the samples contain the firs tc columns 
of B and each sample contains neighboring columns, that is. 



If a sample C contains the first c columns of B, then 



(15) 



B21 BaiWtBi^^ 





B22 B2iWtB|^i 



and B-CW^C^ = 

otherwise, after permuting the rows and columns of B — CW^^C"^, we get the same result: 

n(B-cwtc-)n- . B-n(cwtc-)n- = [ « B22-B2,wtB^, 
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10 20 30 40 50 60 70 GO 00 100 



Figure 8: An illustration of the matrix B — BfJ.*^ for the ensemble Nystrom method, where 
B is defined in (6). Here we set m = 100, c = 20, a = 0.8, and t = 3. For the 
ensemble Nystrom method without overlapping, the matrix B — can always 
be partitioned into four regions as annotated. 



where 11 is a permutation matrix. As was shown in equation (9), B2iW^B|']^ is an 
(m— c)x(m— c) matrix with all entries equal to 



I — a + ca 



Based on the properties of the matrix B — C^^^W^^^^C*^*-*"^, we study the values of the 
entries of B — BfJ*^. We can express it as 



B-B?- 



B - ^ ^ C« W«^C»'^ = ^ E (b - C» WtC«^) , (16) 

1=1 i=l 



and then a discreet examination reveals that B — Bj"*^ can be partitioned into four kinds of 
regions as illustrated in Figure 8. We annotate the regions in the figure and summarize the 
values of entries in each region in the table below. (Region 1 and 4 are further partitioned 
into diagonal entries and off-diagonal entries.) 



Region 



1 (diag) 1 (off-diag) 2 3 4 (diag) 4 (off-diag) 



tc tc? — tc (tc)^— t(? 2tc(m ~tc) m — tc (m — tc)^ — (m — tc) 
i^{a-n) '-^{a-v) ^(Q-r?) l-y a-y 



^Entries 
Value 
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Now we do summation over the entries of B — Bj°** to compute its squared Frobenius 
norm: 



B-rjCnsll^ 
I ~^t,c\\F 



tc 



t - 1 



- z 

{1-1]) H h [(m - tcf - (m - tc)] (q - -qf 



[1 - a)(l + a - 2r?)(m - 2c+ -) + (a - ri)^{4c - Acm + + 



2cm — 3c^ 
i 



(l-a)^(|m-2c+^) + 



> (l-a)^m-2c + 



l-a \2 
a ' 



[m 



c, / 2 X c(m — c) 

2c+- (--2 + m) + ^ 

t a t 



1 + 



m+ f + ^ - 2 

t a 



C + 



> f m-2c + 



where the last inequahty follows from '^^ = | ^(m — 2c + |) + (c — |^ 

Furthermore, since the matrices B — C*-*-* W^C^*^^ are all p.s.d. by Theorem 19, so their 
sum is also p.s.d. Then the p.s.d. property of (B — B|°**) follows from (16). Therefore, the 
nuclear norm of (B — Bf'J.'*) equals to the matrix trace, that is. 



|B-B?-| 



tr(B-Br/) 

tc ■ (1 — T]) + {m — tc) ■ {1 — rj) 



(1 — a)(m — c) 



c + 



c + 



l-a ' 



which proves the nuclear norm bound in the theorem. 



Theorem 23 (Additive-error lower bound of the ensemble Nystrom method) Assume 
that the ensemble Nystrom method selects a collection of t samples, each sample C*-*^ 
(i = 1, ■ ■ ■ ,t) contains c columns of A without overlapping. For a the matrix A defined in 
(7), the approximation error incurred by the ensemble Nystrom method is lower bounded by 



1-^ ^t,c \\p 



> (l-a) 



m — 2c+- — k] + k 



m — c + J + k 



l-a \ 2 



c + k 



A-Af-L > (l-a)(m-c) 



c+^k 



c + 



l-a 



k' 



where Af\ 



1 Y^l_-^ c(*)w(*)'^c(*)^. 



Proof According to the construction of A in (7), the i-th sample C*-*^ is also block diagonal. 
We denote it by C^*) = BlkDiag(cS*\ • • • , cj,*^). Akin to (14), we have 



* ens 



lEUcr^wI(c^ 
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Thus the approximation error of the ensemble Nystrom method is 

t 



k 

= E 


\b-] 

k 




> (1- 





i=l 



f + - - 2 

t a 



m 



P+T + 



, , 2(1-0)-, 



(Ci + 



l-a \2 



where the inequahty fohows from Lemma 22, and the last equality follows from X]j=i '^j ~ ^ 
and kp = m. The summation in the last equality equals to 

k 



E 



Ci 2(1 - a) 

P+- + 

t a 



2 Ci + 



1 — a 



a 



, Cj , 2(l-a) 



2 



]-^N2 



r • -I- 



m-c + f + A;i^"^ 2 



c + A; 



Here the inequality holds because the function is minimized when ci 
Finally we have that 

c 



Ck = c/k. 



A-A?- 



> {l-aY 



m — 2c + - — k] + k 



m - c+ J + k ^ 



l-a \ 2 



c + k 



l-a 



which proves the Frobenius norm bound in the theorem. 

Furthermore, since the matrix B - \ cfw]{cff is p.s.d. by Lemma 22, so 
the block diagonal matrix (A — A.ff) is also p.s.d. Thus we have 



A-A?- 



(l-a)j;(p-c.)-^^ > (l-a)(m-c)(l + — ^), 



i=l 



Ci + 



c+^k 



which proves the nuclear norm bound in the theorem. 



Theorem 24 (Relative-error lower bound of the ensemble Nystrom method) Assume 
that the ensemble Nystrom method selects a collection of t samples, each sample C*-*-* 
(i = 1, • • • ,t) contains c columns of A without overlapping. Then there exists an mxm 
p.s.d. matrix A such that the relative-error ratio of the ensemble Nystrom method is lower 
bounded by 

\\A - A^^^^Wf ^ lm-2c + c/t-k k{m - 2c + c/t) 



i|A — Afclli^' V m — k V ' 

|A- - Af^J'W^ ^ m- c /-^ _^ ^ 
||A — Afc||=K ~ m — k \ c 
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where Af'^J' = \ ^^^^ c(*)W»^C»^. 

Proof The theorem follows directly from Theorem 23 and Lemma 18 by setting a — t- 1. ■ 
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